Analytical description of finite size eff'ects for RNA secondary structures 
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The ensemble of RNA secondary structures of uniform sequences is studied analytically. We 
calculate the partition function for very long sequences and discuss how the cross-over length, 
beyond which asymptotic scaling laws apply, depends on thermodynamic parameters. For realistic 
choices of parameters this length can be much longer than natural RNA molecules. This has to be 
taken into account when applying asymptotic theory to interpret experiments or numerical results. 
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I. INTRODUCTION 

Folding of biopolymers is a fundamental process in 
molecular biology without which life as we know it would 
not be possible. In biopolymer folding, well-characterized 
interactions between individual monomers make a poly- 
mer fold into a specific structure believed to minimize 
the total interaction free energy. The apparent simplic- 
ity in the formulation of this biopolymer folding problem 
is in sharp contrast with the immense challenges faced 
in actually describing biopolymer folding quantitatively 
caused by the intricate interplay of monomer- monomer 
interactions and the constraint that the monomers are 
connected into a chain of a certain sequence. The bi- 
ological importance of biopolymer folding paired with 
this immense intellectual challenge has sparked numerous 
computational and theoretical studies These studies 
do not only attempt quantitative predictions of specific 
structures but also focus on more fundamental proper- 
ties of the biopolymer folding problem such as its phase 
diagram. 

While the bulk of the work concentrates on the folding 
of proteins due to its overwhelming importance in phar- 
maceutical applications, recently RNA folding has been 
identified as an ideal model system for biopolymer fold- 
ing p, 3]. RNA is a biopolymer of four different bases 
G, C, A, and U. The most important interaction among 
these bases is the formation of Watson- Crick (WC) base 
pairs, i.e., A-U and G-C pairs. This comparatively sim- 
ple interaction scheme makes the RNA folding problem 
very amenable to theoretical approaches without losing 
the overall flavor of the general biopolymer folding prob- 
lem. Again, a lot of effort has been devoted to under- 
standing fundamental properties of RNA folding such as 
the different thermodynamic phases an ensemble of RNA 
molecules can be in as a function of temperature, an ex- 
ternal force acting on the molecules, and the sequence 
design iiillii. 

All these theoretical approaches are concerned with the 
phase behavior of RNA molecules in the thermodynamic 
limit. In order to compare these theoretical predictions 
with numerical or actual biological experiments it is thus 
important to know which role finite size effects play, i.e., 
at which size of a molecule the universal predictions of the 
asymptotic theories are expected to hold. In this publica- 



tion we precisely aim to answer this question. We study 
homogeneous RNA sequences which allows us to analyt- 
ically solve for the universal asymptotic behavior as well 
as the cross-over length below which the universal theory 
is not applicable any more. We find that this cross-over 
length is very strongly dependent on the sequence of the 
molecule. For realistic energy parameters we find that 
the cross-over length can be as large as 10,000 bases. This 
is about the largest size of naturally occurring RNAs as 
well as the largest length of RNA molecules amenable to 
quantitative computational approaches. Thus, we con- 
clude that finite size effects have to be seriously taken 
into account when describing RNA folding by asymptotic 
theories. 

This article is organized as follows: In Sec. II, we briefly 
review the definition of RNA secondary structure. In 
Sec. Ill, we analytically derive the finite size effects of 
the simplest model of RNA folding namely a homoge- 
neous sequence without loop entropies. While this model 
is mainly treated for pedagogical purposes, in Sec. IV, we 
sketch how the result can be generalized to more realis- 
tic models of RNA folding. In Sec. V, the behavior of 
the cross-over length A'o is discussed. We find that Nq 
depends mostly on the degree of branching of the RNA 
molecules and a simple approximate formula is derived. 
These results are shown to be consistent with the numer- 
ical values obtained using experimentally known energy 
parameters for specific sequences in Sec. VI. We point out 
how enormous finite size effects in the RNA secondary 
structure formation problem can be. The detailed deriva- 
tions of the partition function and the cross-over length 
are relegated to various appendices. 



II. REVIEW OF RNA SECONDARY 
STRUCTURES 

A. Definitions 

RNA usually occurs as a single-stranded polymer with 
four types of monomers (bases) G, C, A, and U. The 
strand can bend back onto itself and form helices con- 
sisting of stacks of stable Watson-Crick pairs (A with U 
or G with C). 

An RNA secondary structure describes which bases are 
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I — stem — I 

FIG. 1: An RNA secondary structure. The thick line stands 
for the backbone of the molecule and thin lines stand for base 
pairings. The solid dots represent monomers. 5' and 3' show 
the head and tail of this RNA of length 58. Many different 
loops formed when RNA folds are also defined in the figure. 



bound and can be written as a set of binding pairs 
where i and j denote the z*'' and j*** base of the RNA 
polymer respectively. For example, the secondary struc- 
ture S of Fig. n is written as 

S = {(2, 57), (3, 56), (j, j), (fc,0, (37,44)}. 

In this study, we apply the common approximation to 
exclude the so-called pseudo-knots 2], i.e., for two base 
pairs and (fc, I), the configurations i<k<j<l and 
k <i <l < j are not allowed. As a result, the analytical 
studies become more tractable. 




FIG. 2: Pseudo-knots in RNA structures: The base pairings 
indicated by the arrow in a) create a pseudo-knot, b) the 
short pseudo- knots (called "kissing hairpins"), c) the long 
pseudo-knots in 3 dimensions. 

This exclusion of pseudo-knots is reasonable. For long 
pseudo-knots, the double helix structure would require 
threading one end of the molecule through its loops many 
times so they are kinetically difficult to form. Thus, these 
pseudo-knots occur infrequently in natural RNA struc- 
tures ;3!? ]. Short pseudo-knots, on the other hand, do 
not contribute much to the total free energy because of 
the relatively low binding energies for short sequences 
and the strong electrostatic repulsion of the backbone 
since the polymer backbone is highly charged. By ex- 
cluding pseudo-knots, we will stay close to commonly 
used algorithms that compute the exact partition func- 
tion which can be applied to test our model. 



B. Interaction energies 

Since the tertiary interactions between structures are 
in general much weaker than the interactions among the 
secondary structures 0i 13 1 we will follow the common 
approaches and take into account only the energy contri- 
bution from the secondary structures. 

If we assign a Gibbs free energy AG(S) to each sec- 
ondary structure S, the partition function of the ensem- 
ble of all structures is given by 

Z = ^e-'^«(«). (1) 
s 

The Gibbs free energy is commonly used to describe 
the secondary structure since it contains entropic contri- 
butions from the formation of loops as well as enthalpic 
terms from the formation of base pairs. The total free 
energy G(S) is the sum of the energy contributed from 
each elementary piece such as the stacking of base pairs 
and the connecting loops. The largest contribution are 
the stacking energies between adjacent WC pairs, and 
these energies depend on the type of bases in the pairs. 
While the typical value of the stacking energy is on the 
order of ksT at room temperature, both the enthalpic 
and entropic terms are on the order of lOfc^T. Thus, the 
stacking energy will become repulsive with a moderate 
increase of temperature to around 80° C and the RNA 
molecule denatures. 



III. MOLTEN PHASE 

In order to get a qualitative understanding of finite 
size effects, we first follow previous works |E 13 IE IE ^ 
and assume that the Gibbs free energy is the sum of the 
binding free energies of each base pair in the structure, 

AG(S)= J2 ey", (2) 
(»,i)es 

and neglect the entropic energies due to the formation of 
loops for the rest of this section. 

The binding free energies in this model are differ- 
ences between the gain in chemical binding energy and 
the loss in the configurational entropy associated with 
the formation of the base pairs. Since both contributions 
are large and comparable, the realistic values of the eij 
strongly depend on temperature. Since we do not de- 
scribe spatial degrees of freedom in this model, it does 
not describe denaturation of the RNA molecule and we 
restrict ourselves from here on to a parameter regime 
where the majority of the bases is paired, i.e., where a 
significant fraction of the is negative. 

In order to obtain analytical insights into the finite 
size effects, we additionally assume that the binding free 
energy eij is a constant epj independent of the identities 
of the bases. Thus, in our simplified model AG(S) = 
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FIG. 3: Recursive relation for a simple model of an RNA 
molecule. The wavy lines stand for undetermined structure 
and the corresponding partition function. The arch repre- 
sents a binding pair between bases k and N. The assumption 
of excluding pseudo-knots separates the last term into two in- 
dependent parts since two pairs can not go across each other. 



eq X |S| where |S| stands for the number of pairs in S. 
This simplified energy model serves as the basis of our 
study for the more realistic energy model. 

As it stands, this energy model and the more realistic 
energy model we will introduce later describe only homo- 
geneous sequences. However, it has been argued that this 
energy model can be applied to random RNA sequences 
at high enou gh temperature when the disorder is suffi- 
ciently weak '1(J|. Under this weak disorder, there exist 
many structures with nearly degenerate energies and the 
corresponding scaling laws match the predictions of the 
simplified energy model. Only as the temperature is low- 
ered, a strong disorder phase arises. This low tempera- 
ture phase is characterized by a small number of distinct 
low-energy structures, and is referred to as the "glass 
phase" in analogy with studies of other disorder system. 
However, this glass phase is not within the scope of this 
article. 

The partition function of the molten phase model can 
be obtained through the recursive relation in Fig. 13 This 
figure shows how the possible ways of binding can be 
decomposed into two cases where the last base N is either 
unbound or bound to some base k. If we define G{N+1) 
as the partition function of an RNA of length N, the 
relation reads 

N-1 

G{N + l)^G{N)+Y,G{k)-q-G{N-k), (3) 

k=l 

where q = e~^'^°. Together with the boundary condition 
G(l) — 1, this equation allows calculation of the exact 
value of the partition function G{N) in 0{N'^) time. The 
Vienna package is able to calculate this exact value 
with more complete sequence dependent energy param- 
eters based on the similar scheme. This recursive equa- 
tion (3) also leads us to the analytical expression for the 
partition function. By introducing the z-transform. 



G{z) ^ J2 G(^) 



(4) 



N=l 



and applying it to Eq. (3), we get a quadratic equation 
for G{z) as 



zG{z) - 1 = G{z)+qG^{z), 



(5) 



from which G{z) can be solved. For large sequence 
lengths, the partition function G{N) is obtained by per- 



forming the inverse z-transform on G{z), and can be ap- 
proximated (see Appendix A) as 



G{N) ^ G{z)z 
2m ' 



N~l 



dz. 



(6) 



A{q)N-'z^{q)[l^t:^ + 0{^)l (7) 



where Zc{q) = 1 + 2^ is the branch point of G(z), 
e = 3/2, A{q) = [(1 + 2Vg)/W/2]i/2 and = 
3(1 + 4y^)/16Y^. This asymptotic analytical formula is 
only determined by the behavior of G{z) near the branch 
point Zc- The exponent = 3/2 indicates the character- 
istic universal behavior of this partition function for long 
sequences. The non- universal cross-over length A^o('z) 
characterizes how long a sequence has to be for the uni- 
versal laws to hold. Here, we find an explicit analytical 
formula for Nq as a function of parameter q. 

From the formula of No{q), we can see that the cross- 
over length in this simple model is on the order of 1 for 
all values of q. However, in the following sections we 
will show that the cross-over length may vary over sev- 
eral order of magnitudes from several bases to 1,000,000 
bases. Thus, the loop entropies of the more realistic en- 
ergy model would greatly modify the behavior of cross- 
over length. 

IV. INCLUDING LOOP ENTROPIES 

To get a more quantitative understanding of the cross- 
over length, we now take into account the loop entropies 
and introduce Boltzmann factors, s, &, i, h, and m, for the 
different types of loops (see Fig.P). The values of these 
free energy parameters have been carefully measured [l5l | 
such that our model can be applied quantitatively to re- 
alistic RNA molecules. Typically, the free energy of a 
stacking loop is large and negative (s ^ 1) while the free 
energies for all the other loops tend to be large and pos- 
itive, leading to Boltzmann factors much less than one. 
The binding energy eg of the simple model introduced 
above is now absorbed into these loop free energies. As 
mentioned in Sec. HI, we still restrict ourselves to a tem- 
perature regime below denaturation which is determined 
by the true energy model. 

Again, we want to calculate the partition function of 
the structure ensemble and derive the cross-over length 
as a function of the loop parameters. This calculation in 
principle follows along the lines of Sec. Ill, but is techni- 
cally much more elaborate because of the more compli- 
cated energy model. A reader more interested in the final 
results than in the technical details is advised to directly 
skip to the Sec. V. 

In order to calculate the partition function, we separate 
the secondary structure into two categories as shown in 
Fig. ^ One is the bubble structure which contains only 
hairpins and multiloops. The other is the stem struc- 
ture, which connects the bubbles, containing only stack- 
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FIG. 4; Separation of stems from the bubble structure. 



structures are not allowed, e.g., since a base can not be 
shared in two base pairings, a stem with 3 bases does not 
exist and this leads to S{2) = 0. Also when the length of 
a stem is small, certain loops which require many bases 
are not allowed, e.g., the only available structure for a 
stem with 4 bases is a stacking loop. Including these 
conditions, we apply the z-transform on Eq. (8) with the 
definition 



ing loops, bulges, and interior loops. Wc will study each 
of them individually and later combine them together. 



S{z) = J2 S{N)2 



N=l 



(9) 



A. Stem structure 

In principle, the loop free energy depends on the length 
of a loop. Thus, unbound bases also contribute to the 
total free energy. This contribution has been experimen- 
tally measured for small loops and behaves logarithmi- 
cally with length when the loop is large. However, in the 
following we show that the free base energy of unbound 
bases provides only a negligible effect on the behavior of 
a stem and thus on the cross-over length if the stacking 
energy is relatively large. 

To explore all possible ways of pair bindings, again a 
graphical recursion relation is helpful. Such a recursion 
relation is shown in Fig. \^ Starting from a closed pair 
on the left, the following loop can be either a stacking 
loop, a bulge or an interior loop which correspond to the 
terms on the right hand side. To study the influence of a 
free energy for unbound bases, we assign the Boltzmann 
factors, b and i, to each unbound base in a bulge and 
an interior loop respectively. If we deflne the partition 
function of stem structures with N bases and the first 
and the last bases of which is paired as S{N — 1), which 
corresponds to the left hand term, this relation in Fig. [S] 
is formulated as 



N-2 



S{N~~1) = sSiN -3) + 2'^bb 

k=3 

N-3 N-2 
a=3 b=a+l 



^-'=-iS'(fc - 2) 
S{b-a). (8) 



to resolve the convolution. Solving for S{z) gives us 

-I -1 



S{z) 



1 



2b -b 



\z~^b) z^{z-ty 



(10) 



Again the partition function S{N) can be obtained by 
applying the inverse z-transform on S{z). 

To illustrate the effect from the unbound bases, we 
show how their Boltzmann factors affect the average be- 
havior of the long stem structure. The average quantity 
of a certain type of loop or unbound base can be cal- 
culated as T • dr\nS{N) where r is the corresponding 
Boltzmann factor. Here, we specifically calculate the av- 
erage number of unbound bases per interior loop, which 
is defined by i ■ di\nS{N)/i ■ d^\nS{N) in the large N 
limit, as a function of i (see Appendix B for detailed 
calculations) . 
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FIG. 6: Average number of unbound bases per interior loop 
v.s. i for different stacking energy. Following the measured 
free energies [l^ . we chose the typical values b = 0.01, b = 
0.85, and i — 0.05 for the other Boltzmann factors. 



FIG. 5: Recursion relation for stem structure. The dashed 
lines stand for the undetermined structures. Thick lines rep- 
resent the backbone and thin lines stand for pair bindings. 

To perform the z-transform, we have to consider the 
initial conditions, 5(1) = 1, 5(2) = 0, 5(3) — s, and 
5(4) = 2bb. These initial conditions arise because certain 



From Fig. El we can see that this average number is 
barely affected by i when the Boltzmann factor s for a 
stacking loops is large. This can be easily understood as 
follows: For an interior loop, the unbound bases always 
introduce a strong energy penalty since less bases are 
available for stacking. This penalty is much larger than 
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the penalty due to loss of the degree of freedom by one 
more free base. Thus when the binding energy is large, 
the free base entropic penalty can be neglected. From 
Fig. El we can see that interior loops tend to stay at 
the smallest length (2 unbound bases) when s is large, 
independent of the value of i. Since the same argument 
applies to bulges as well, we will set i and 6 to 1 for the 
rest of this publication. 



B. Bubble structure 

In a similar fashion, a recursive relation for the bubble 
structure is found graphically as shown in Fig. [3 In 
the first relation for a closed bubble structure, we can 
have either a hairpin loop or a multiloop following from 
the closed pair at the end. In the second relation, the 
multiloop structure can be decomposed into two cases 
where the last base is either unbound or bound. Since 
a multiloop has to have at least 3 branches, we have a 
term with two more bubble structures; the last recursive 
term produces more branches. 
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FIG. 7: Recursion relation for bubble structures. Dashed lines 
stand for undetermined structures. Thick lines and thin lines 
stand for the backbone and the pair binding, respectively. 
In a), the left hand term represents an undetermined bubble 
structure with the two end bases paired. The double dashed 
line stands for an undetermined multiloop structure and it 
can be decomposed into the components in b), as explained 
in the main text. 

By defining the partition function for the closed bubble 
structure and multiloop structure with N bases as B{N — 
1) and M{N+1) respectively, the recursive relations read 



B{N-l) = tih + m- M{N -I)) 



(11) 



AT-l 



M{N + 1) = M{N) + ^ M{k) ■ B{N - k) (12) 
fc=i 

N-3 N-2 N-1 

+ E E E B{b-a)B{N-k). 

a=l b=a+l k=b+l 

Here an additional Boltzmann factor t is introduced in 
the first relation at the position where two bubbles are 
connected. Later we will insert stems into the bubble 
structure by replacing t with the partition function of 
the stem structure. We also neglect the free base energy 



for unbound bases in hairpins and multiloops for similar 
arguments as above. 

In this recursive relation, the smallest multioop should 
have at least 4 bases such that two branchings can be 
connected. Thus, we set the initial conditions as M(l) — 
M(2) = M{3) = Af(4) = 0, which forbids a multiloop 
with length less than 4. With the initial conditions, the 
recursive relations result in the following quadratic equa- 
tion for the z-transformed B{z) : 



{z-iy 



B'- 



z- 1 



h 



B + h = 0. (13) 



C. Complete structure 

To combine the stem and bubble structures, we insert 
a stem structure at each position represented by t which 
is a placeholder for the connections between multiloops 
and hairpin loops in the bubble structure. In this case 
the first relation in Fig. O is modified as indicated in 
Fig.lHl 
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FIG. 8: Replacing the position of t by stem structures. The 
left hand term represents the closed structure with both stem 
and bubble structures in it. 

By defining the partition function of the closed struc- 
ture on the left hand side as C{N — 1), the Fig. |H1 reads 



C{N-l) 



N-1 

E 

fc=i 



S{k) ■ B{N ~ k). 



(14) 



After z-transform, this relation results in (7 = {zS)B. 
Thus, the following replacement 



t 



B^C 



(15) 



combines stem and bubble structures together. In or- 
der to complete all possible structures, the single strands 
outside the closed end pair also have to be included. This 
can be done by going back to the first recursion relation, 
Eq. (3), and replacing G{N — k) by the closed structures 
C{N — fc), which relates G{z) to C{z) as 

1 



^ (z - 1) - C 
Putting everything together, we obtain 

-1 2m -/i' 



G 



1 
2m 



zS 



z-1 



(16) 



(17) 




z- 1 



Ah 
zS 



Ahm 



6 



Notice that the leading singularity in G is again from 
the branch cut induced by the square root. Thus, as ex- 
pected, the inverse z-transform leads to the same univer- 
sal behavior (7) with an exponent ^ = 3/2 as the simple 
model we studied first. However, no n- universal quanti- 
ties such as the cross-over length Nq will depend on the 
parameters of the extended model. 



D. Minimum hairpin length constraint 

For natural RNA molecules, a hairpin loop needs to 
have at least 3 unbound bases due to the width of the 
double helix (which implies j — z > 3 in the secondary 
structure). This minimum hairpin length constraint is 
easily taken into account in our calculations. Under the 
constraint, a bubble structure which contains at least one 
hairpin must have at least 5 bases. Thus, we adopt the 
initial condition, B{1) = B(2) = B{3) = 0, when we 
perform the z-transform on Eq. (11). The summation 
range in Eq. (11) is then changed and it simply leads to 
a substitution oi hhy h/z^ in all subsequent equations. 

This substitution is reasonable since z represents the 
Boltzmann factor for the free energy of one single base. 
The minimum hairpin length constraint reduces the num- 
ber of available bases for binding by 3, so it introduces 
an energy penalty 3 • ln2; which causes the Boltzmann 
factor h to be divided by z^. In this way, we can easily 
introduce any kind of ininimuni length constraint via a 
similar substitution. For example, if we require a bulge 
loop to have at least two unbound bases instead of one, 
the replacement of 6 by h/ z will include this constraint. 

Note, this principle also helps us to understand equa- 
tion (10) for the stem structure. The terms with different 
powers of z, namely s/z^, fo/z"^, and i/z^, arise because 
a stacking loop reduces the number of available bases for 
pairing by 2, a bulge loop has at least one unbound base, 
and an interior loop has a minimum of 2 unbound bases, 
respectively. 



V. BEHAVIOR OF THE CROSS-OVER LENGTH 

In this section, we will use the general results of Sec. IV 
in order to calculate the cross-over length in our model 
for sequences with loop entropies. 



A. Large stacking energy approximation 



over length since it is obtained under the homogeneous 
molten phase model. 

However, since this calculation involves finding the 
root of a fourth order polynomial, no meaningful analyt- 
ical expression can be found in general. Thus, we resort 
to a large s approximation in order to obtain an analyt- 
ical expression. This approximation is justified since the 
Boltzmann factor of a stacking loop, s, is usually much 
larger than 1 while the loop Boltzmann factors 6, i.h, and 
m are less than one. In this approximation, from Eq. (17) 
we find the branch point Zc ~ v^, i.e., the free energy per 
base is / = —kT ■ In(zc) « ^AG(s). This can be easily 
interpreted since we expect most bases to form pairs due 
to the favorable stacking loops such that the free energy 
per base is half of the free energy of a stacking loop. 



B. Cross-over length 

Including the minimum hairpin length constraint in- 
troduced in Sec. IV. D, we expand the branch point Zc 
near y/s. Then, the approximated analytical formula for 
the cross-over length becomes 



No 



+ 



3s3/4 



96 



11?; - 66 

4~s 



+ 0(s-3/2) 



(18) 



It has a straight forward interpretation: The simplest 
possible structure is a long stem with one hairpin at the 
end. Every additional branching of the structiire requires 
formation of one hairpin and one multiloop. Since upon 
formation of the hairpin and mulitloop, at least 3 bases 
become unbound, the Boltzmann factor for a branching 
is hm/s^^'^. The prefactor in Eq. (18) is up to the nu- 
merical factor of 3/8 the inverse square root of this ex- 
pression. Thus, we conclude that the cross-over length 
which becomes larger as the Boltzmann factor for a sin- 
gle branching becomes smaller can be interpreted as the 
minimum length that allows a certain degree of branch- 
ings. 

Since the Boltzmann factors 6 and i appear only in 
the higher order terms, they barely modify the cross-over 
length and can be neglected altogether. This is consistent 
with the fact that b and i only play roles in the stem 
structure, but not in the bubble structure. The leading 
term of the approximated analytical formula (18) will be 
referred to as the large s cross-over length. 



To derive the cross-over length, we need to solve for the 
branch point Zc that is defined by the vanishing of the 
term under the square root in Eq. (17). In principle, we 
can always obtain the numerical value of the branch point 
and expand G{z) around this point to obtain a numerical 
value for the cross-over length (see Appendix A) . We will 
refer to this numerical value as the homogeneous cross- 



C. Reliability of the large s approximation 

For the analytical large s cross-over length to be useful, 
we have to know how good they agree with the numeri- 
cal value of the homogeneous cross-over length. Here, we 
compare these two values for many different choices of 
energy parameters covering the whole range of realistic 
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values. Fig. 9 shows how the large s cross-over length 
approaches the homogeneous one as s gets large. Typi- 
cal values for the Boltzmann factor of a stacking loop s 
involving GC pairs are s > 30 'l5], so the approximation 
is very good in this region. For stacking loops involving 
AU pairs, s is around 5 so a deviation from the approxi- 
mated formula in the large s limit can be seen. However, 
since Nq only sets the order of magnitude of the length 
beyond which the asymptotic theory is applicable, the 
large s cross-over length with a deviation by a factor of 
2 at s = 5 is still a good estimation. 
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by the Vienna package. This size measures the average 
number of base pairs to be crossed when connecting the 
base to base 1 (Fig. ^j), which captures the size 
of the secondary structure. We expect I to obey 



/cX^V2.(l_(^)l/2 + 0(l) 



(20) 



where the leading term is the asymptotic behavior [T3 
and the next term reflects the first expected correction 
which is a constant independent of N. We determine / for 
sequences of different lengths and extract the full numer- 
ical cross-over length iVo by fitting data obtained via the 
Vienna package to Eq. (20). This is then compared to 
our large s cross-over length. 



A. {GCA)n sequence 
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Boltzmann factor s 

FIG. 9: Ratio between the large s cross-over length and 
the numerical value of the homogeneous one for many com- 
binations of the parameters: b,i = {0, 1} and h,m = 
{0.1,0.01,0.001}. These choices cover the region of realistic 
values. 



VI. NUMERICAL VERIFICATION 

While for a generic RNA sequence, the molten phase 
model is believed to only apply at sufficiently high tem- 
perature, it can be applied to repeated sequences at 
all temperatures below denaturation since each repeated 
unit can be viewed as the equivalent of a base in the 
molten phase fFig. I10|l . To illustrate the correctness of 
the calculations shown in Sees. IV and V, and to get a 
feeling for typical cross-over lengths, we now compare our 
large s cross-over length of repeated sequences with the 
full numerical results. The full numerical result is ob- 
tained by using the Vienna package 11] which can calcu- 
late the exact value of the partition function and other 
observables for arbitrary sequences using a realistic en- 
ergy model. 

As an observable, we choose the average size I of a 
structure. This quantity is defined as 

N/2 N 
k=l k'=N/2+l 

where Pk^k' is the probability that bases k and k' are 
paired. The latter probability is also calculated exactly 




FIG. 10: Equivalence between the {GCA)n sequence and the 
molten phase : Three consequent bases GCA are mapped to 
one single base so the GC/CG stack is equivalent to a binding 
pair in the molten phase. Then, the smallest interior loop on 
the left is viewed as a stacking loop and the following interior 
loop becomes a bulge in the molten phase. The hairpin loop 
on the right with 2 units of GCA is considered to have two 
unbound bases in the molten phase. 

We apply this scheme to a repeated {GCA)n sequence. 
Such sequences naturally occur in the gene for Hunting- 
ton's disease and their secondary structures are believed 
to play a role in this disease. Since the GC/CG stack 
in the secondary structure of the {GCA)n sequence is 
much more favorable than any other combination, we 
can exclude the possibility to have binding pairs other 
than GC/CG. Thus, GCA is viewed as one unit base in 
the molten phase. With this equivalence we can use the 
experimentally determined parameters [l^ and calculate 
the equivalent energy parameters s, 5, i, h, and m for the 
molten phase model. For example, the stacking energy of 
the molten phase is the sum of GC/CG stacking energy 
and the free energy for the interior loop of length 2 in 
the {GCA)n sequence. 



Fig.llllshows the full numerical results for the average 
size of the secondary structure for a (GCA)„ sequence 
as a function of the sequence length N. By fitting the 
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FIG. 11: Average size of the secondary structure of a {GCA)n 
sequence at 37° C. The data is fitted by the expected law (20). 

result to the Eq. (22), we get a full numerical value of 
the cross-over length of 6.9 bases. 

To compare this full numerical value with our large s 
cross-over length, we plug the equivalent energy param- 
eters of the molten phase model into the approximated 
formula 3s^^'^{y/s ~ 1)^/8%/ /im. This formula is different 
from equation (18) because the minimum hairpin length 
of the corresponding molten phase model only 1 instead 
of 3. The resulting large s cross-over length is about 
2.3 repeat units which corresponds to 6.9 bases in the 
{GCA)n sequence. This agrees very well with the full 
numerical value. 



B. {AU)„ sequences 

Repeated AU sequences have already been suggested 
as models for the molten phase by de Gennes in 1968 0. 
For such sequences, we exclude the possibility of A A or 
UU binding pairs since they are not favorable at all. In 
a similar fashion as for the {GCA)n sequence in Fig llUI 
the smallest bulge has two free bases. However, since 
the minimum hairpin length is 4 instead of 3, in order 
to match AU or UA closing pairs, the large s cross-over 
length is 3s{^/s — \)'^ /sVhm. 

Plugging in the correct values for the parameters at 
body temperature results in a large s cross-over length 
of Aq ~ 7700. A verification of this value is beyond the 
reach of the numerical procedure using the Vienna pack- 
age. Since the cross-over length is expected to decrease as 
the denaturation transition is reached, the full numerical 
verification could be performed at a higher temperature. 
Thus, we first study the the temperature dependence 
of the cross-over length Nq. To this end we choose to 
study the homogeneous cross-over length instead of large 
s cross-over length because it is not clear whether the 
large s approximation is appropriate when approaching 
the denaturation temperature. When studying the tem- 
perature dependence, it is important to note that the free 
energies themselves have a very strong temperature de- 
pendence which changes Boltzmann factors much more 



than the explicit (3 in the exponent. By looking up the ta- 
ble of enthalpies corresponding to different loops, we de- 
termine this temperature dependence of the Boltzmann 
factors s, h, i, h, and m. The numerical values of the 
homogeneous cross-over lengths thus obtained are shown 
in Fig. El From this figure, we choose to numerically 
verify the estimate of the cross-over length at 57°C. 



T 




: , I I I I I I L. 

20 40 60 80 

Temperature ( C) 



FIG. 12: Numerical values of homogeneous cross-over length 
for {AU)n sequence with respect to temperatures. This quan- 
tity is obtained by numerically calculating the branch point 
and then expanding G{z) around this point(see Appendix A). 



The full numerical verification is done in a similar way 
as Sec. VLA. Fitting numerical data at 57°C gives a full 
cross-over length Nq « 900. Our homogeneous cross-over 
length predicts a value Nq « 1500 which is of the same 
order of magnitude as the full numerical result. From the 
above numerical studies, we conclude that our theory for 
the large s cross-over length is a good estimate. Thus, the 
cross-over length of {AU)n at body temperature is truly 
around 7,700 bases, which is almost the largest size of 
naturally occurred RNA sequences. For {GC)n sequences 
at body temperature, the situation is even more dramatic 
with a predicted cross-over length about 105,000 bases. 
This is beyond the limit of most natural RNAs and sug- 
gests that the structure ensemble of {AU)n and {GC)n 
molecules can never be described by the asymptotic the- 
ory for any naturally available molecules. 

C. Distribution of cross-over length for short 
repeated sequences 

Lastly, we want to explore the sequence dependence of 
the cross-over length for short repeated sequences. We 
study all possible partially self-complementary repeated 
sequences of unit length two and three and evaluate their 
cross-over lengths using the formula for the large s cross- 
over length. 

For non self-complementary repeated sequences, e.g., 
the {GGC)n sequence, it is not a priori clear that we can 
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y/C\ y^Cv, yC-^ y^Cv, yC-^ 

G-C .G-C G-C G-C / G-C G-C G 

C-G. / C-G. ,C-G. ,C-G. C-G. ,C-G. C 
^C^ ^C^ ^C^ ^C^ ^C-^ ^C'' 

FIG. 13; A {GCC)n stem in the ground state. The dash hnes 
stand for base pairing of ground state. The dot lines show 
the possibilities for base G to pair with neighboring C. 



question can at this stage neither be addressed numer- 
ically nor analytically since even the expected asymp- 
totic behavior is elusive. Nevertheless, the existence of 
large cross-over length in the molten phase as established 
here suggests that also in the glass phase large cross-over 
length have to be considered as a serious possibility. 



map the repeated unit GCC onto one single base in the 
molten phase (Fig. \\?>\ since base G can pair with either 
one of the two bases C. However, we verified numerically 
that more than 99% of the G's are paired in the stacked 
configuration indicated by the dashed lines in Fig. ^| 
This is due to the large loss in free energy as one stack- 
ing loop is lost and a bulge loop is created. We verified 
this behavior for all the partially self-complementary se- 
quences in our discussion. Thus, we can safely use the 
same approach as used for the {GCA)n sequence and 
calculate the large s cross-over length. The results are 
shown in the following table. 



Seq 


UAA 


UAC 


GCA 


CCA 


GU 


GCU 


AUU 


AUG 


iVo 


4 


7 


7 


34 


40 


70 


115 


130 


Seq 


GCC 


CGU 


UAC 


AUG 


AU 


CCG 


GC 




No 


450 


2800 


3900 


5600 


7700 


10700 


105000 





TABLE I: Large s cross-over length of short repeated se- 
quences. 



We can see that the cross-over lengths spread over the 
whole range from only several bases to 10^ bases. This 
non universal behavior has to be treated individually for 
different sequences. For longer repeated units, we expect 
similar behavior but we did not verify this due to the 
complicated entropic effects. 



VII. CONCLUSION 

For our asymptotic description to be appropriate, an 
RNA sequence must be long enough to allow a certain 
degree of branching. The cross-over length at which this 
happens for homogeneous sequences is strongly sequence 
dependent. Specifically for cases when stacking is very 
favorable and hairpins and multiloops are unfavorable, 
the cross-over length iVg can be very large. In this re- 
gion, finite size effects are very important and have to 
be taken into account in any numerical simulation or in- 
terpretation of experimental data in terms of asymptotic 
formulas. Our analytical results provide an easy way to 
estimate this cross-over length for realistic energy mod- 
els. 

It remains as an interesting and relevant question how 
large the cross-over lengths for generic, non-homogeneous 
sequences in the glassy phase are. Unfortunately, this 



VIII. APPENDIX 



z-transform 



For a series of numbers {G(l), G(2), 
z-transform of G(N) is defined as 

oo 

G{z) = ^(^) 

N=l 



G(N), ...}, the 



-N 



To recover G{N), we apply the inverse z-transform: 



G{N) = ^ 



G{z) 



c 



As an example, here we solve for the partition func- 
tion for the molten phase model. From Eq. (5), G{z) is 
calculated as 

1 



G{z) = ^(^ - 1 - v/(^ - - 4<z), 

so the partition function G{N) can be obtained by plug- 
ging G(z) into the inverse z-transform. 

The analytical part (z — l) can be dropped since it does 
not contribute to the integral. The square root introduces 
a branch cut with two branch points at the ends. By 
taking the contour C as the smallest loop around the 
branch cut, the integral becomes 



GiN) 



1 



1 

2^ 



'(v/(z-l)2-4g + ie 
v/(z - 1)2 - 4q - ie) 



dz 



1)2 .z^-lrfz, 



where Zc — I + is the branch point with greatest 
real part and zq is the other branch point. In the large 
N limit, due to the exponent factor z^ , we expect that 
only the behavior near Zc is important, so we can expand 
the integrand around Zc and perform the approximation. 

To do this integral, we first consider a more general 
case where G{z) — ^ f{z). The behavior near the great- 
est value of z is obtained correctly if we replace z by 
and expand -\/ /(/i) in the power terms of (/ic — /i) as 



V-f'iP-c)itJ-c 

4/'(a^c) 



-M)^/2 
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Together with the following approximation vaHd in the 
large N limit, 

' (/Xc - iJ,)"e^^diJ, « r(l + a)N-^^+°'^e^'=^, 
the inverse z-transform of G{z) can be expressed as 



G{N) = 



TT 2 



1 



fii^c) r(5/2) 



Here, the partition function of the molten phase 

model (7) is obtained by setting f{^) = Aq — (e^ ^ 1)^- 
For the general case, we can put the result in terms of z 
with the substitutions 



dm 



dz 
df{z) 



dz 



2 



dz"^ 



After little algebra, the cross-over length is obtained as 



for arbitrary f{z). 



1 



df{z) I 
dz I 



model because there is no branch cut, so we do not ex- 
pect the universal behavior N^'^^'^. However, the fact 
that only the pole with the greatest real part is impor- 
tant in the large N limit is preserved. From Eq. (8), the 
z-transform of the partition function for the stem struc- 
ture S{z) has the form f{z)/g{z) where 



g{z) = z'^{z-h){z-if - s{z-b){z-if 
-2bb{z-if -i?{z-b). 



In the large N limit, the inverse z-transform can be ap- 
proximated by 



S{N) = ^.f S{z)z^-^dz 



1 



2-111 J g'{zc){z- Zc) 



where Zc = Zc{s,b,b,i,i) is the pole with greatest real 
part. 

Since z^ is the dominating term for large N, the aver- 
age quantity can be approximated as 



(r) = TdrHSiN)) 

« NrdrHzc) = Nz^^TdrZc. 



B. average length of interior loops 

The calculation of partition function for the stem 
structure S{N) is different from that for the molten phase 



So the average number of unbound bases per interior loop 
can be obtained by {idiZc) / {idiZc) ■ Here, the pole Zc can 
only be solved numerically. 
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